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Viral codon usage is shaped by the conflicting forces of mutational pressure and selection to match host patterns for 
optimal expression. We examined whether genomic architecture (single- or double-stranded DNA) influences the degree 
to which bacteriophage codon usage differ from their primary bacterial hosts and each other. While both correlated 
equally with their hosts' genomic nucleotide content, the coat genes of ssDNA phages were less well adapted than those 
of dsDNA phages to their hosts' codon usage profiles due to their preference for codons ending in thymine. No specific 
biases were detected in dsDNA phage genomes. In all nine often cases of codon redundancy in which a specific codon 
was overrepresented, ssDNA phages favored the NNT codon. A cytosine to thymine biased mutational pressure working 
in conjunction with strong selection against non-synonymous mutations appears be shaping codon usage bias in ssDNA 
viral genomes. 



Introduction 

Viruses usually exhibit genomic signatures that closely mimic 
those of their primary hosts, 1,2 in part to better evade innate and 
acquired immune responses. 3,4 However, the majority of the close 
adherence to host nucleotide usage is attributed to selection for 
improved translational speed and efficiency, which are correlates 
of viral fitness. Synonymous codons are used at different frequen- 
cies in virtually all organisms, 5,6 and the most frequently used 
codons correlate with the most abundant tRNAs within a cell. 7,8 
These favored synonymous codons are therefore recognized 9 and 
translated 10 " 12 more rapidly. The most frequently expressed cel- 
lular genes within a given organism exhibit similar patterns of 
this codon usage bias (CUB) and are more biased than less fre- 
quently expressed genes. 11,13 " 16 For viruses, these factors should 
contribute to increased rate of replication when strictly adhering 
to host CUB. Therefore many viruses have been under selective 
pressure to match the CUB of their preferred hosts. 17 Despite 
increased attention to the genomic match between viruses and 
their hosts, there have been few studies examining how different 
viral genomic architectures facilitate or hinder adaptation to their 
hosts' genomes. 

Phages are the optimal system in which to explore how 
genomic architecture affects viral molecular evolution. The 
codon bias expressed in prokaryotic hosts is constant for each 
host cell, unlike multi-cellular organisms, in which codon usage 
profiles are affected by tissue-specific gene expression. 18 Perhaps 
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due to this, phage are more strongly adapted to their primary 
hosts' CUB than eukaryotic viruses," allowing the greatest 
potential to identify factors that diminish the match between 
virus and host genomes. Bacterial hosts also offer a wider range of 
genomic nucleotide content to examine compared with plant or 
mammalian hosts, and their CUB have been well-documented. 
Additionally, while phage host ranges are far from perfectly 
annotated, bacteriophage host ranges are usually quite narrow 20 
and many of their host ranges have been better delineated than 
eukaryotic viruses, such as phytopathogens. 19 

Two distinct phage genomic architectures (single-stranded 
DNA, ssDNA and double-stranded DNA, dsDNA) have been 
amply sequenced; unfortunately, the small number of sequenced 
RNA phages precludes their close examination at this time. The 
two DNA-based architectures are subject to specific constraints: 
dsDNA phages can house the largest genomes, up to -300 kb, 21,22 
whereas even the largest ssDNA phages are smaller than 10 kb. 23 
Many dsDNA phages encode their own tRNAs, (e.g., T4 encodes 
eight 24 ), decreasing selection for adherence to host CUB, whereas 
none have been found in ssDNA phages. dsDNA phages have the 
lowest mutation rates among viruses, while ssDNA phage muta- 
tion rates are faster, approaching those of a dsRNA phage. 25 ' 26 
Eukaryotic viruses with the same ssDNA genomic architecture 
exhibit evolutionary rates orders of magnitude above those seen 
in eukaryotic dsDNA viruses. 27 Consequently, faster-evolving 
ssDNA phages might be better able to adapt to host-imposed 
genomic conditions. Conversely, the mutation frequency in 
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ssDNA phages may diminish their 
ability to conform to their host codon 
preferences. 

Genomic GC content is a rough 
predictor of CUB, and many viruses 
match the GC content of their hosts. 28 " 
32 Bacteriophage GC content, in par- 
ticular, correlates strongly to that of 
their primary bacterial hosts. 33 We 
measured the similarity in GC con- 
tent between each ssDNA and dsDNA 
GenBank phage reference genome 
and that of its primary host. We used 
the most numerous group of phages 
with a common host, Escherichia coli, 
to compare codon adaptation indices 
(CAI) and relative synonymous codon 
usage (RSCU) for a subset of highly 
expressed genes from dsDNA and 
ssDNA coliphages. 1 Our results show 
that genomic architecture correlates 
to statistically significant differences 
in nucleotide content and codon usage 
between ssDNA and dsDNA phages, 
and point to an enrichment of thymine 



as a cause. 



Figure 1. Correlation between host and phage genomic GC content. Grey squares indicate dsDNA 
phages, open squares ssDNA. Best-fit linear regression lines are solid for dsDNA (r 2 = 0.84) and dashed 
for ssDNA (r 2 = 0.82). There was no significant difference between the correlations (p = 0.72). 
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Figure 2. Mean coat gene CAI with 95% confidence intervals of ssDNA 
(n = 11), dsDNA (n = 34) coliphages. 



GC content in ssDNA and dsDNA 
phages was highly correlated with 
host GC content (r 2 = 0.82 for ssDNA phages, 0.84 for dsDNA 
phages, equally correlated p = 0.72) across a very wide range 
of host GC content (-0.25 to -0.72) (Fig. 1). A previous study 
found significant differences between ssDNA and dsDNA phage 
nucleotide correlation with their hosts, 33 but the additional 333 
dsDNA and 13 ssDNA reference sequences added to GenBank 
since that analysis suggest there is no difference (Table SI and 
Fig. SI). ssDNA phages exhibited a pronounced genomic thy- 
mine bias (average 0.30 T), but nonetheless infected hosts with 
a range of GC contents (0.25 to 0.70), as wide as that of dsDNA 
phages (0.26 to 0.72). 

Correlated GC content was a poor predictor of strong CAI 
match between E. coli and the coat genes of its phages. The mean 
CAI of ssDNA coliphages was 0.706, while the dsDNA phages 
were significantly better matched to E. coli (0.744, p < 0.001, 
Fig. 2). This number includes eight dsDNA coliphage genomes 
for which tail protein encoding genes were used, rather that coat 
protein encoding genes, due to the absence of properly annotated 
coat genes. The inclusion of tail genes did not change the results 
of this analysis (p < 0.001 with and without the eight tail genes). 
The evidence of selection for translational efficiency is stronger 
for dsDNA phages. 

Comparison of the GC content of the first two positions of 
each codon (GO, 2) and the third position (GC3) of these genes 
revealed an interesting pattern: for both ssDNA and dsDNA 
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coliphages, the GO, 2 was restricted to a tight range between 
about 0.45 and 0.55. dsDNA GC3 varied along a wide range, 
from 0.26 to 0.69, but ssDNA GC3 occupied a narrower range, 
from 0.30 to 0.54 (Fig. 3). Furthermore, when plotted with a line 
representing a perfect correlation between GC1,2 and GC3, all 
but one of the ssDNA phages fell to the left of that line (Fig. 3), 
indicating a paucity of GC in the third codon position of their 
coat genes. Conversely, the dsDNA coat genes were GC3-rich 
or GC3-poor in approximately equal numbers. Past studies have 
indicated that strong mutational biases often occur with low lev- 
els of CUB, 34 ' 36 possibly because a strong, non-specific mutational 
pressure would prevent any persistent, directional changes in the 
genome. The consistently lower GC3 content of the ssDNA genes 
suggests that a specific mutational pressure might be reducing 
GC3 content in a directional manner, which is disrupting the 
effects of selection for translational efficiency. 

We further investigated the GC3-poor nature of ssDNA 
coliphage coat proteins with RSCU analysis. It revealed statis- 
tically significant variation in use for 15 of 59 codons between 
ssDNA and dsDNA phage (p < 0.03 for TTG, p < 0.002 for 
CTT and TCC, p < 0.001 for all other codons, Fig. 4). Notably, 
for four of the five codons more frequently used by ssDNA rather 
than dsDNA coliphages, thymine was in the third position. No 
codons enriched in dsDNA phage relative to ssDNA phage con- 
tained thymine in the third positions. 

Calculation of RSCUs of coat genes in 28 ssDNA phages with 
a diverse host range confirmed this pattern: codons with thymine 
in the third position were extremely overrepresented (p < 0.001) 
for six amino acids (A, D, G, I, T, V), and were significantly 
favored (p < 0.012) in three more (H, P, S) (Fig. 5). Only one 
of the remaining nine degenerate amino acids had a statistically 
preferred codon in ssDNA phages (GAA for E, p < 0.01). 

We subdivided our data set to separately examine the two mor- 
phologically distinct families of ssDNA phages, the Inoviridae 
and the Microviridae. Because inoviruses are 
frequently vertically transmitted and can pro- 
ductively infect their hosts without causing 
lysis, they might be under increased selective 
pressure to match the genomes of their more 
permanently associated hosts. RSCU compari- 
sons revealed no consistent patterns associated 
with phage lifestyle. No difference in RSCU 
was evident for 11 of the 16 NNT codons in 
these groups (Fig. S2). 

Cytosines are comparatively unstable and 
readily undergo spontaneous deamination 
to uracil, resulting in C to T transitions after 
unrepaired replication. 37 This spontaneous 
deamination occurs 100 times more frequently 
in ssDNA than dsDNA, resulting in a higher 
mutation rate at cytosines 38 than at other bases 
in ssDNA phage. 35 ssDNA phage genomes 
appear to spend more time truly single- 
stranded, as they do not experience consistent 
intra-strand base pairing or regular secondary 
structure formation while encapsidated. 40 " 45 
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Figure 3. GC1,2/GC3 correlation for ssDNA (open squares) and dsDNA 
(gray squares) coliphage coat genes. Solid line indicates perfect correla- 
tion. Points above the line indicate genes deficient in GC3, points below 
denote genes enriched in GC3. 



This causes ssDNA phages to more frequently have unpaired 
bases than ssRNA genomes, which are constrained by extensive 
stem-loop formation both in the cytosol and when encapsidated. 46 
Any thymine-increasing bias does not appear to have a dis- 
cernible effect on genomic nucleotide content relative to the 
phages' primary hosts. Rather, it is likely that cytosine transitions 
in the first or second positions are subject to strong purifying 
selection relative to the wobble position, 47 " 49 and the signature 
of this mutational bias is only observed in the overabundance 
of thymine in the third position of synonymous ssDNA phage 
codons. The significant overrepresentation of NNT codons is 
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Figure 4. Mean RSCU values and 95% confidence intervals for individual codons with 
statistically significant differences in usage between ssDNA (open squares) and dsDNA (gray 
squares) coliphage coat or tail genes. 
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Figure 5. RSCU values and 95% confidence intervals for ssDNA phage coat gene codons that exhibited an NNT codon preference. Preferred NNT 
codons indicated by bold triangles, NNV codons indicated by squares. 



strongly indicative of a biased mutational pressure acting in con- 
cert with strong selection against non-synonymous substitutions. 

Genomic architecture (nucleic acid, segmentation, stranded- 
ness), while acknowledged as an important characteristic of virus 
taxonomy, is not typically included in broad-scale analyses of viral 
evolution. Instead, most comparisons focus within a single kind 
of virus, 50 and while many of these studies have provided insight 
into the codon usage biases of individual viruses, this is the first 
observation of a specific bias with a possible mechanistic explana- 
tion. Examining across two architectures, we saw strandedness 
play a critical role in the composition of phage genomes, and in 
determining the limits of ssDNA viral adaptation to their hosts. 

Materials and Methods 

All available ssDNA and dsDNA bacteriophage genome refer- 
ence sequences were collected from GenBank on March 16, 
2011. Reference sequences were used to avoid biasing our data 
sets toward any particular phage species, or highly studied phage, 
such as the model organisms PhiX174 or T7. These genomes were 
separated according to genomic architecture for further analy- 
sis. Initially collected were 41 ssDNA phages and 447 dsDNA 
phages (Table S2). For each phage having a known host with a 
sequenced genome (GenBank reference sequence), the relation- 
ship between the GC content of the phage and the host bac- 
terium was examined. Because not every sequenced phage has 
an identified and sequenced host, not all phages were included 
in this analysis. Four ssDNA phages were excluded, as were 44 
dsDNA phages (Table S2). 

The codon usage biases of representative ssDNA and dsDNA 
phages were examined to gain a more complete picture of the 
CUB patterns in both architectures. Codon usage profiles were 
determined using major coat/capsid genes, or, in the eight cases 
for which coat genes were not available, tail gene sequences 
retrieved from GenBank reference genomes (Table S3). These 
structural proteins are highly expressed and exhibit the highest 
degrees of codon usage bias found in phage. 51,52 We compared 
codon usage between the two genomic architectures for phages 
infecting a single host: Escherichia coli. Coat or tail genes from 11 



ssDNA and 34 dsDNA phages were used (Table S3). The online 
CAIcal tool 53 was used to calculate each phage's codon adapta- 
tion index (CAI), a measure of the degree to which one gene or 
set of genes adheres to the CUB of another gene or set of genes, 1 
as implemented by Xia. 54 CAI ranges from zero to one; values 
closer to one indicate a strong correlation. The average CAI was 
calculated for both architectures. 

Frequency of the first and second codon positions (GC1,2) 
and frequency of GC in the third position (GC3) were calcu- 
lated for these genes using CAIcal and relationship between the 
two was analyzed. A plot of GO, 2 against GC3 is a common 
measure of the factors affecting CUB in a gene or set of genes; 
a strong correlation between the two implies that genome-wide 
mutational pressures are the driving force behind CUB, while a 
weaker correlation indicates that some force is unequally affect- 
ing the first two positions and the third position. Usually, this 
is interpreted as implying a selective force acting on CUB, as is 
expected to be the case for viruses under relatively strong selec- 
tion for translational speed. 

To examine the variation in codon usage that contributes 
to the differing CAI values and site-specific base composi- 
tions, relative synonymous codon usage (RSCU) values were 
calculated for the same sets of genes using CAIcal. RSCU is a 
measure of the relative codon usage for each individual degen- 
erate amino acid compared with expected levels if synonymous 
codons were used with equal frequency. An RSCU of about one 
indicates that a codon is used as frequently as expected, while 
values above or below one indicate over or underuse of that syn- 
onymous codon, respectively. Mean dsDNA coliphage RSCUs 
were compared with ssDNA coliphage RSCU to determine the 
proximate cause of the observed variation in CAI. RSCU was 
also calculated for 17 additional sufficiently well-annotated 
genomes of ssDNA phages infecting a wide host range (primarily 
infecting Acholplasma, Bdellovibrio, Chlamydia, Escherichia, 
Propionibacteria, Pseudomonas, Ralstonia, Spiroplasma, Vibrio 
and Xanthomonas, Table S4), and the complete set of 28 ssDNA 
phage RSCUs was assessed for consistent CUB. For amino acids 
with 6-fold redundancy (L, R, S), RSCUs were calculated sep- 
arately for the codon sets with 4-fold and 2-fold redundancy. 
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Significantly biased codon use was measured for each codon 
with one-tailed t-tests (Microsoft Excel) and Bonferroni correc- 
tion for multiple comparisons (ct = 0.017 for 4-fold, a = 0.025 
for 3-fold). 
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